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Abstract 



We present a feature learning model that learns to encode relationships between 
images. The model is defined as a Gated Boltzmann Machine, which is con- 
strained such that hidden units that are nearby in space can gate each other's con- 
nections. We show how frequency/orientation "columns" as well as topographic 
filter maps follow naturally from training the model on image pairs. The model 
also offers a simple explanation why group sparse coding and topographic feature 
learning yields features that tend to by grouped according to frequency, orienta- 
tion and position but not according to phase. Experimental results on synthetic 
image transformations show that spatially constrained gating is an effective way 
to reduce the number of parameters and thereby to regularize a transformation- 
learning model. 



1 Introduction 

Feature-learning methods have started to become a standard component in many computer- vision 
pipelines, because they can generate representations which are better at encoding the content of 
images than raw images themselves. Feature learning works by projecting local image patches onto 
a set of feature vectors (aka. "filters"), and using the vector of filter responses as the representation of 
the patch. This representation gets passed on to further processing modules like spatial pooling and 
classification. Filters can be learned using a variety of criteria, including maximization of sparseness 
across filter responses |18|, minimizing reconstruction error ll24ll . maximizing likelihood |5|, and 
many others. Under any of these learning criteria, Gabor features typically emerge when training on 
natural image patches. 

There has been an increasing interest recently in imposing group structure on learned filters. For 
learning, filters are encouraged to come in small groups, such that all members of a group share 
certain properties. The motivation for this is that group structure can explain several biological phe- 
nomena such as the presence of complex cells |6 |, it provides a simple way to model dependencies 
between features (eg., |9| and references therein), and it can make learned representations more 
robust to small transformations which is useful for recognition 1 10|. Filter grouping is referred to 
also as "structured sparse coding" or "group sparse coding". Feature grouping can also be used as a 
way to obtain topographic feature maps |8|. To this end, features are layed out in a 2-dimensional 
grid, and groups are defined on this grid such that each filter group shares filters with its neighboring 
groups. In other words, groups overlap with each other. Training feature grouping and topographic 
models on natural image patches typically yields Gabor filters whose frequency, orientation and po- 
sition is similar for all the filters within a group. Phase, in contrast, tends to vary randomly across 
the filters within a group (eg. fTlfTQllS I). 

Various approaches to performing group sparse coding and topographic feature learning have been 
proposed. Practically all of these are based on the same recipe: The set of filters is pre-partitioned 
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into groups before training. Then during learning, a second layer computes a weighted sum over 
the squares of all filter responses within a group (eg., OIUITOl). The motivation for using square- 
pooling architectures to learn group structure is based in part on empirical evidence (it seems to work 
across many models and learning objectives). There have also been various attempts to explain it, 
based on a variety of heuristics: One explanation, for example, follows from the fact that some non- 
linearity must be used before pooling features in a group, because in the absence of a nonlinearity, 
we would wind up with a standard (linear) feature learning model. The square is a non-linearity that 
is simple and canonical. It can also be shown that even- symmetric functions, like the square, applied 
to filter responses, show strong dependencies. So they seem to capture a lot of what we are missing 
after performing a single layer of feature learning |8|. Another motivation is that Gabor features 
are local Fourier components, so computing squares is like computing spectral energies. Finally, it 
can be shown that in the presence of a upstream square-root non-linearity, using squared features 
generalizes standard feature learning, since it degenerates to a standard feature-learning model when 
using group size 1 1 10 |. For a summary of these various heuristics, see [8] (page 215). Topographic 
grouping of frequency, orientation and position is also a well-known feature of mammalian brains 
(eg., 1 8 1). Thus, another motivation for using square-pooling in feature learning in general has been 
that, by means of replicating this effect, it may yield models that are biologically consistent. 

In this work, we show that a natural motivation for the use of squared filter responses can be de- 
rived from the perspective of encoding relationships between images. In particular, we show that 
the emergence of group structure and topography follows automatically from the computation of 
binocular disparity or motion, if we assume that neurons that are nearby in space exert multiplica- 
tive influences on each other. Our work is based on the close relationship between the well-known 
"energy models" of motion and binocularity 1 1 , 17 1 and the equivalent "cross-correlation" models 
|[3l[14l. It may help shed light onto the close relationship between topographic organization and 
motion processing as well as binocular vision. That way, it may also help shed light onto the phe- 
nomenon that topographic filter maps do not seem to be present in rodents ll2Qll . 

2 Factored Gated Boltzmann Machine 

While feature learning has been applied predominantly to single, static images in the past, there has 
been an increasing interest recently in learning features to encode relationships between multiple 
images, for example, to encode motion (eg., 1 1 1 , 22 1). We focus in this work on the Gated Boltzmann 
Machine (GBM) |15, 22] which models the relationship between two binary images x and y using 
the three-way energy function 

E{x, y,h) = ^ WijkXiyjhk (1) 

ijk 

The energy gets exponentiated and normalized to define the probability over image pairs (we drop 
any bias-terms here to avoid clutter): 

h x,y,h 

By adding appropriate penalty terms to the log-likelihood, one can extend the model to learn real- 
valued images [15 ] . 

Since there is a product involving every triplet of an input pixel, an output pixel and a mapping 
unit, the number of parameters is roughly cubic in the number of pixels. To reduce that number, 
1 16] suggested factorizing the three-way parameter tensor W with entries Wijk in Equation [T] into a 
three-way inner product: 

/ 

Here, / is a latent dimension that has to be chosen by hand or by cross-validation. This form of 
tensor factorization is also known as PARAFAC or "canonical decomposition" in the literature, and 
it can be viewed as a three-way generalization of the SVD | 2 |. It is illustrated in Figure [T] (left). The 
factorization makes use of a diagonal core tensor, that contains ones along its diagonal and that is 
zero elsewhere. Plugging in the factorized representation for W and using the distributive law yields 
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Figure 1: Factorizing the parameter tensor of a Gated Boltzmann Machine (left) is equivalent to 
gating filter responses instead of raw pixels (middle). In a group-gating model (right), filters are 
grouped, and products of all filters within a group are passed on to the mapping units. In the example, 
there are two groups, resulting in four unique products per group. 

the factorized energy function: 

^ = E ( E ( E %7%) ( E ^"^M (4) 

f i j k 

Inferring the transformation h, given two images, x and y, is efficient, because the hidden variables 
are independent, given the data 1 16|. More specifically we have p{h\x^y) = Yli^p{hk\x^y) with 

p{hk\x, y) = cT{^w^f{Y^ w^fXi) ( w^fyj)) , (5) 

/ i j 

where a is the logistic sigmoid a{z) = jip^^^^^^^. Thus, to perform inference, images are projected 
onto F basis functions ("filters") and those basis functions that are in correspondence (i.e., that have 
the same index /) are multiplied. Finally, each hidden unit, hk, receives a weighted sum over all 
products as input. An illustration is shown in Figure [T] (middle, right). 

It is important to note that projections onto filters typically do not serve to reduce the dimensionality 
of inputs. Rather, it is the restricted connectivity in the projected space that leads to the reduction in 
the number of parameters. In fact, it is not uncommon to use a number of factors that is larger than 
the dimensionality of the input data (for example, [ 1611191 ). 

To train the model, one can use contrastive divergence |T6l, score-matching or a variety of other 
approximations. Recently, | 21 , 13 1 showed that one may equivalently add a decoder network, effec- 
tively turning the model into a "gated" version of a de-noising auto-encoder ll24ll and train it using 
back-prop. Inference is then still the same as in a Boltzmann machine. We use this approach in 
most of our experiments, after verifying that the performance is similar to contrastive divergence. 
More specifically, for a training image pair (x^y), let Ph{x, y) denote the vector of inferred hidden 
probabilities (Eq. [5]) and let Wx-, Wy^ Wh denote the matrices containing the input filters, output 
filters, and hidden mters, respectively (stacked column-wise). Training now amounts to minimizing 

the average reconstruction error (^y — y{x)) -\- (^x — x{y)) , with 

y{x) = Wy [(w;^Ph{x, y)) * [wjx)) , (6) 

where * denotes element- wise multiplication. x{y) is defined analogously, with all occurrences of 
X and y exchanged. One may add noise to the data during training (but reconstruct the original, not 
noisy, data to compute the cost) |24|. We observed that this helps localize filters on natural video, 
but on synthetic data (shifts and rotations) it is possible to train without noise. 

2.1 Phase usage in modeling transformations 

Figure [2] (left, top) shows filter-pairs that were learned from translated random-dot images using 
a Factored Gated Boltzmann Machine (for details on training, see Section |4] below). Training on 
translations turns filters into Fourier components, as was initially observed by 1 16 |. The left-bottom 
plot shows histograms over the occurrence of frequencies and orientations for input- and output- 
image filters, respectively. These were generated by first performing a 2D-DFT on each filter and 
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Figure 2: Left, top: Filter pairs learned from translated random-dot images. (Input filters on the 
left, output filters on the right). Left, bottom: Corresponding histograms showing the number 
of filters per frequency /orientation bin. The size of each blob is proportional to the number of 
filters in each bin. Right: Phase differences between learned filters. Each row corresponds to one 
frequency/orientation bin. Colors are used to simplify visibility. 



then picking the frequency and orientation of the strongest component for the filter. The histograms 
show that the learned filters evenly cover the space of frequencies and orientations. This is to be 
expected, as all frequencies and orientations contribute equally to the set of random translations 
(e.g., f4l). 

It is also well-known, however, that multiple different translations will affect each frequency and 
orientation differently. More specifically, any given translation induces a set of phase- shifts for every 
component of a given frequency and orientation. The phase-shift depends linearly on frequency 
|4l. Likewise, two different image translations of the same orientation will induce two different 
phase-shifts for each frequency /orientation. In order to represent translations, it is necessary to 
specify the phase at every frequency and orientation. This shows that mapping units in a GBM 
must have access to multiple different phase-shifts at every frequency and orientation to be able to 
represent translations. Also, there is no need to multiply filter responses of different frequencies 
and/or orientations, only of different phases. 

As a repercussion, for each input filter in Figure [2] (left, top) there need to be multiple phase-shifted 
copies present in the output-filter bank, so that filters with varying phase differences can be matched 
(which means the filters' responses will be multiplied with each other). Likewise for each output 
filter. When the number of factors is small, the model has to find a compromise, for example, by 
connecting hidden variables, h^, to multiple phase-shifted versions of filters with the correct phase- 
shift but with only a similar, but not the same frequency and orientation. Figure |2] (right) depicts the 
occurrences of phase differences between input- and output-image filters. It shows that the model 
learned to use a variety of phase differences at each frequency/orientation to represent the training 
transformations. In the model, each phase difference corresponds to exactly one filter pair. 

The analysis generalizes to other transformations, such as rotations or natural videos (e.g., |[T4l ). 
In particular, it also generalizes to Gabor features which are localized Fourier features that emerge 
when training the GBM on natural video (e.g., 1221 ). 



3 Group gating 

The analysis in the previous section strongly suggests using a richer connectivity that supports the 
re-use of filters. In this case the model can learn to match any filter wj with multiple phase-shifted 
copies It; J of itself rather than with a single one. All phase differences in Figure |2] (right) can then 
in principle be obtained using a much smaller number of filters. 



One way to support the re-use of filters to represent multiple phase- shifts is by relaxing the diagonal 
factorization (Eq.|3j with a factorization that allows for a richer connectivity: 

^ijk = Yl ^defWfaW^j^wl f (7) 
def 

where we Cdef are the components of a (non-diagonal) core tensor C. Note that, if C is diagonal 
so that Cdef = 1 iff <i = e = /, we would recover the PARAFAC factorization (Eq.|3]). The energy 
function now turns into (cf. Eq. [TO]): 

E = J2^def{J2 ^id^i) ( E <y^) ( E ^kfhk) (8) 

def i j k 

and inference into 

p{hk\x,y) = a{Y,C,,fwtf{Y,wt,Xi){Y,wy^yj)) (9) 

def i j 

As the number of factors is typically large, a full matrix C would be computationally too expensive. 
In fact, as we discuss in Section |2] there is no other reason to project onto filters than reducing 
connectivity in the projected representation. Also, by the discussion in the previous section, there is 
very a strong inductive bias towards allowing groups of factors to interact. 

This suggests using a core-tensor that allows features to come in groups of a fixed, small size, such 
that all pairs of filters within a group can provide products to mapping units. By the analysis in 
Section |2j training on translations or natural video is then likely to yield groups of roughly constant 
frequency and orientation and to differ with respect to phase. We shall refer to this model as "group- 
gating" model in the following. As the values Cdef may be absorbed into the factor matrices and 
are learned from data, it is sufficient to distinguish only between non-zero and zero entries in C, and 
we set all non-zero entries to one in what follows. 

By defining the filter groups Qg, g = 1^ . . . ^G,we can write inference in the model consequently as 

p{h,\x,y) = a{J2Yl E<<i-ie.l+e(E<^^^)(E"'|e%)) (10) 

g deQg eeGg i j 

which is illustrated in Figure [T] (right). Note that each hidden unit can still pool over all pair- wise 
products between features. The overall number of feature products is equal to the number of groups 
times the group size. In practice, it makes sense to set the number of factors to be a multiple of the 
group size, so that all groups can have the same size. 

A convenient way to implement the group-gating model is by computing all required products by 
creating G copies of the input- factor matrix and G copies of the output-factor matrix and permuting 
the columns (factors) of one of the two matrices appropriately. (Note in particular that when using 
a large number of factors, masking, i.e. forcing a subset of entries of C to be zero, would not 
be feasible.) It is possible to reduce the number of filters further by allowing for multiplicative 
interactions between only one filter per group from the input image and all filters from the output 
image (or vice versa). This leads to an asymmetric model, where the number of filters is not equal 
for both images. 

3.1 Significance for square-pooling models 

It is interesting to note that the same analysis applies to the responses of energy-model complex cells 
(e.g., ||71[T9l), too, if images and filters are contrast normalized 1 14 |. In this case, the response of the 
energy model is the same as a factored GBM mapping unit applied to a single image, i.e. x = y (see, 
for example, (191 HH). This shows that the gating interpretation of frequency/orientation groups and 
topographic structure applies to these models, too. 

In the same way, we can interpret a square-pooling model applied to a single image as an encod- 
ing of the relationship between, say, the rows or the columns within the patch. In natural images, 
the predominant transformation that takes a row to the next row is a local translation. The emer- 
gence of oriented Gabor features can therefore also be viewed as the result of modeling these local 
translations. 
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Figure 3: Classification accuracy of a Gated Boltzmann Machine vs. a cRBM on shifts (left) and 
rotations (right). A total of 25 different sets of parameters were tested, with the numbers of filters 
and mapping units being varied between parameter sets. 



4 Experiments 



Cross-correlation models vs. energy models: To get some quantitative insights into the rela- 
tionship between cross-correlation and energy models, we compared the standard Gated Boltzmann 
Machine with a square-pooling GBM [e.g. 19, 13] trained on the concatenation of the images. We 
used the task of classifying transformations from the mapping units, using transformations for which 
we can control the ground truth. The models are trained on image patches of size 13 x 13 pixels, 
which are cropped from larger images to ensure that no border artifacts are introduced. Training-, 
validation- and test-data contain 100.000 cases each. We use logistic regression to classify transfor- 
mations from mapping units, where we determine the optimal number of mapping units and factors, 
as well as the learning rates for the transformation models on the validation data. Using random 
images ensures that single images contain no information about the transformation; in other words, 
a single image cannot be used to predict the transformation. 

We compare the models on translations and rotations. Shifts are drawn uniform-randomly in the 
range [—3, 3] pixels. We used four different labels corresponding to the four quadrants of motion 
direction. Rotations are drawn from a von Mises distribution (range [— tt, tt] rad), which we scaled 
down to a maximal angle of 36°. We used 10 labels by equally dividing the set of rotation angles. 

The results are shown in Figure [3] and they demonstrate that both types of model do a reasonably 
good job at prediction the transformations from the image pairs. The experiment verifies the approx- 
imate equivalence of the two types of model derived, for example, in |3, 14 1. The cross-correlation 
model does show slightly better performance than the energy model for large data- set sizes, and the 
difference gets more pronounced as the training dataset size is decreased, which is also in line with 
the theory. Energy models have been the standard approach to feature grouping in the past El. 

Learning simple transformations: We trained the group-gating model with group size 3, 128 
mapping units and 392 filters on translating random-dot images of size 13 x 13 pixels. Figure |4] 
shows three pairs of plots, where the left, center and right pair depict the dominant frequencies, 
orientations and phases of the filters, respectively. We extracted these from the filters using FFT. 
Each pair shows properties of input (top) and output filters (bottom). 

Within an image, each filter is represented by a single uni-colored square. For frequency plots, 
the squares are in gray-scale with the brightness corresponding to the frequency of the filter (black 
represents frequency zero, white is the highest frequency across all filters); in the orientation and 
phase plots, the squares differ in color, where the angle determines the color according to the HSV 
representation of RGB color space. The figures confirm that filter-groups tend to be homogeneous 
with respect to frequency and orientatioir] Contrary to that, phases differ within each group, as 
expected. The same can be seen in Figure]?] (left and middle), which shows subsets of filter groups 
learned from translations and rotations of random images, using patchsize 25 x 25 and group size 5. 



^In some groups, there appear to be some outliers, whose frequency or orientation does not match the other 
filters in the group. These are typically the result of spurious maxima in the FFT amplitude due to small 
irregularities in the filters. 
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Figure 4: Frequency (left), orientation (middle) and phase (right) of filters trained on translated 
image pairs, with filters applied to the input images in the top and filters for the output image on the 
bottom. A group size of 3 is used here, which means each three consecutive input filters and their 
corresponding output filters form a group. Notice that frequency and orientation are often identical 
for all filters within a group, whereas the phase varies. Best viewed in color. 
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Figure 5: Filter-groups learned from translations (left) and rotations (middle) using a group-gating 
model with group size 5. Each group corresponds to one column in the plot. Right: Filter-groups 
learned from natural videos, using a group-gating model with group size 4. Only output filters are 
shown. Frequency, orientation and position are constant, while phase varies, within each group. 



Learning transformation features from natural videos: Figure [5] (right) shows a subset of filter 
groups learned from about one million patches of size 12 x 12 pixels that we crop from the van 
Hateren broadcast television database | 23 1. We trained the asymmetric model to predict each frame 
from its predecessor. All patches were PC A- whitened, retaining 95% of the variance. We trained 
a model with 128 mapping units and 256 filters, and we used a group size of 4. The figure shows 
that the model learns Gabor features, where frequency, orientation and position are nearly constant 
within groups, and the set of possible phases is covered relatively evenly per group. 

Quantitative results for group-gating: Classification accuracies for the same model and data as 
described in Section |4] are reported in Table [T] The equivalent number of group-gating filters is 
shown in parentheses, where equivalence means that the group-gating model has the same num- 
ber of parameters in total as the factored GBM. This makes it possible to obtain a comparison 
that is fair in terms of pure parameter count by comparing performance across the columns of the 
tables. However, the table also shows that even along columns, the group-gating model robustly 
outperforms the factored GBM, except when using a very small number of factors. In this case, the 
parameter-equivalent number of 44 factors is probably too small to represent the transformations 
with a sufficiently high resolution. 



4.1 Topographic feature maps from local gating 

We also experimented with overlapping group structures by letting Qg share filters. This makes it 
necessary to define which filters are shared among which groups. A convenient approach is to lay 
out all features ("simple cells") in a low-dimensional space and to define groups over all those units 
which reside within some pre-defined neighborhoods, such as in a grid of size n x n units (e.g., 1 8]). 

Figure [6] shows the features learned from the van Hateren data (cf.. Section]?]) with patchsize 16 x 16 
and 99.9% variance retained after whitening. We used 400 filters that we arranged in a 2-D grid 
(with wrap-around) using a group size of 5 x 5. We found that learning is simplified when we also 
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Translation Rotation 

Number of filters Gating Group Gating Gating Group Gating 

81 (44) 89.69% 81.68% 90.08% 89.06% 

225 (121) 92.40% 92.59% 91.50% 94.65% 

441 (237) 92.20% 93.36% 90.76% 95.65% 

729 (392) 91.44% 93.46% 91.71% 95.51% 

900 (483) 91.29% 93.20% 89.95% 95.00% 

Table 1 : Classification accuracy of the standard and group-gating models on image pairs showing 
translations and rotations. 




Figure 6: Topographic filter maps learned by laying out features in two dimensions and allowing for 
multiplicative interactions among nearby units. Left: Input filters, Right: Output filters. 



initialize the factor-to-mapping parameters so that each mapping unit has access to the filters of only 
one group. 

The figure shows how learned filters are arranged in two topographic feature maps with slowly 
varying frequency and orientation within each map. It also shows how low-frequency filters have 
the tendency of being grouped together |8|. From the view of phase analysis (see Section [2]), the 
topographic organization is a natural result of imposing an additional constraint: Filter sets that 
come in similar frequency, orientation and position are now forced to share coefficients with those 
of neighboring groups. The simplest way to satisfy this constraint is by having nearby groups be 
similar with respect to the properties they share. The apparent randomness of phase simply follows 
from the requirement that multiple phases be present within each neighborhood. It is interesting to 
note that using shared group structure is equivalent to letting units that are nearby in space affect 
each other multiplicatively. Localized gating may provide a somewhat more plausible explanation 
for the emergence of pinwheel-structures than squaring non-linearities, which are used, for example, 
in subspace models or topographic ICA (see 16) [7J). 

5 Conclusions 

Energy mechanisms and "square-pooling" are common approaches to modeling feature dependen- 
cies in sparse coding, and to learn group- structured or invariant dictionaries. In this work we re- 
visited group-structured sparse coding from the perspective of learning image motion and disparity 
from local multiplicative interactions. Our work shows that the commonly observed constancy of 
frequency and orientation of filter-groups in energy models can be explained as a result of rep- 
resenting transformations with local, multiplicative feature gating. Furthermore, topographically 



8 



structured representations ("pinwheels") can emerge naturally as the result of binocular or spatio- 
temporal learning that utilizes spatially constrained multiplicative interactions. Our work may pro- 
vide some support for the claim that localized multiplicative interactions are a biologically plausible 
alternative to square pooling for implementing stereopsis and motion analysis 1 12 |. It may also help 
explain why the development of pinwheels in VI may be tied the presence of binocular vision and 
why topographic organization of features does not appear to occur, for example, in rodents 1201 . 
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